Predicting and understanding human action decisions during skillful joint-action using supervised machine learning and explainable-AI

This study investigated the utility of supervised machine learning (SML) and explainable artificial intelligence (AI) techniques for modeling and understanding human decision-making during multiagent task performance. Long short-term memory (LSTM) networks were trained to predict the target selection decisions of expert and novice players completing a multiagent herding task. The results revealed that the trained LSTM models could not only accurately predict the target selection decisions of expert and novice players but that these predictions could be made at timescales that preceded a player’s conscious intent. Importantly, the models were also expertise specific, in that models trained to predict the target selection decisions of experts could not accurately predict the target selection decisions of novices (and vice versa). To understand what differentiated expert and novice target selection decisions, we employed the explainable-AI technique, SHapley Additive explanation (SHAP), to identify what informational features (variables) most influenced modelpredictions. The SHAP analysis revealed that experts were more reliant on information about target direction of heading and the location of coherders (i.e., other players) compared to novices. The implications and assumptions underlying the use of SML and explainable-AI techniques for investigating and understanding human decision-making are discussed.


Expert and Novice Herding performance
As in [1,2], the herding performance of players was assessed using the following five measures. (1) Gathering time, which is the time period t g ∈ [0, T ], where all the passive agents are within the containment area for the first time. (2) Distance traveled by the herders d g , which is the mean distance (in meters) traveled by the herders during the time interval [0, t g ]. (3) Herd distance from containment region D g , which captures the herders ability to keep the herd close to the containment area, calculated with respect to the center of the containment area. A smaller average distance indicates better ability of the herders to keep the herd close to the containment region. (4) Herd spread S g , which measures the scatter of the herd within the game field during the time interval [0, t g ]. Lower values corresponds to a more cohesive herd and consequently better herding performance. The herd spread is evaluated with respect to the area of the containment region, A cr = π(r ) 2 , as S g,% = S g /A cr · 100. And, (5)Containment rate I % , which measures the herders' ability to relocate one or more target agents inside the containment region. It is defined as the mean in time of the percentage of agents in the containment area during the time interval [0, t g ]. Performance was assessed with respect to the 48 expert and 40 novice data trials employed for model training and testing. The average and SD for each measure as a function of expertise is reported in Table 1, with experts performing better than novices with regard to all measures. More specifically, Kruskal Wallis statistical tests revealed significant differences between Novice and Expert pairs for gathering time t g (χ 2 = 24.67, p < 0.0001), distance traveled d g (χ 2 = 5.76, p < 0.02) and the average distance of the herd from the containment region D g (χ 2 = 24.33, p < 0.0001).

Inter-target movement times
For each successful trial, the inter-target movement times of experts and novice herders were determined by calculating the difference between the time a herder began influencing it's current target and the time the herder stopped influencing the previous target. Figure 1 reports the distribution of inter-target movement times for both expert and novice pairs. The average inter-target movement time was 556 ms for novices (with 65% of the total inter-target movement times ≤ 600ms) and 470 ms for experts (with 72.5% of the total inter-target movement times ≤ 600ms).

Performance of target selection models with different sequence lengths and prediction horizons
The SML approach presented in the main article can be customized to forecast the ID of the target that will be corralled by a herder for different lengths of state input sequence, T seq , and different prediction horizons T hor . Here, T seq corresponded to a time-series of relevant state variables, fixed to N seq = 25, Fig. 1 Distribution of inter-target movement times [ms] of expert (blue) and novice (orange) herders. The average inter-target movement time was 556 ms (65% of the total inter-target movement times < 600 ms) for novices and 470 ms (72.5% of the total inter-target movement times< 600 ms) for experts.
such that T seq is scaled by tuning the sampling time dt. The output prediction is the ID of the next target to be corralled by the herder at T hor in the future.
In the main text we reported the accuracy of models trained using T seq = 1 second of system state evolution (i.e., dt = 2 or 40 ms) and prediction horizons T hor = 16dt and 32dt, which corresponded to prediction horizons of 640 ms and 1280 ms, respectively. However, we also trained models for T seq = .5, and 2 seconds (where dt = 1 and 4 time steps or 20 and 80 ms, respectively) and for T hor = 1dt and 8dt. Thus, T hor ranging from 20 to 640 ms for T seq = .5 seconds and from 80 ms to 2.56 seconds for T seq = 2 seconds.
The accuracy values for the different combinations of T seq and prediction horizon T hor are reported in Figure 2. Overall, model accuracy was relative stable across the different combinations of T seq and T hor for both experts and novice models. Consistent with the results reported in the main text the models were also expertise specific for all combinations of T seq and T hor .
As mentioned in the main text, it is important to understand that when T hor < 600 ms the prediction horizon entailed predicting a target selection decision that had already been made by a herder. Thus, for dt = 1 or 2 time steps, the T hor = 1dt and 8dt prediction horizons were of less interest here as the input data sequence would have included data from the enactment of the already made target selection decision (i.e., predictions were based on herder state information already specifying the made decision). They do, however, provide a benchmark measure of accuracy for the T hor = 640 ms and 1280 ms models presented in the main text and for the T seq = 2 second models for T hor ≥ 8dt. That is, T hor < 600 ms predictions provide a measure of model accuracy when the target selection decision is potentially well specified in the input data (particularly for T hor = 1dt). The fact that the model accuracy for T hor > 600 ms was comparable to T hor << 600 ms illustrates the robustness of

Performance of target selection models with different type of samples
As detailed in the main text, during the time interval T seq , a herder could either continuously corral the same target agent or transition between different targets. These were classified as "non-transitioning" and "transitioning" behavioral sequences, respectively. Similarly, at T hor , a herder could be corralling the same target agent that was being corralled at the end of T seq or "switch" to different target agent. These were classified as "non-switching" and "swithching" behavioral sequences, respectively. The resultant four data sample types are illustrated in Figure 3 in the main text. Importantly, the number of "switching" samples within the data set used for model training and testing was dependent on T hor . Indeed, both novice and expert sample data contained less than 2% transitioning-switching and less then 3% non-transitioning-switching samples when T hor = 1, and less than 5% transitioning-switching and less than 7% non-transitioning-switching samples when T hor = 8. The different distributions of sample type as a function of T hor is illustrated in Figure 3. Figure 4 details the performance of the trained LSTM N N models on N test = 2000 test samples randomly extracted from the different sample types.  That is, in contrast to the models presented in the main text, the accuracy values reported in Figure 4 reflect the accuracy of LSTM N N models trained on an unbalanced (representative) set of training samples. Not surprisingly, the accuracy of the models for a specific type of sample was dependent on the proportion of samples with the data set, with the accuracy for switching samples greatly reduced for T hor = 1dt, 8dt and 16dt (particularly when T seq = .5 seconds; i.e., when dt = 1 or 20 ms), because of the reduced number of switching  samples in the training set. Indeed, there is a direct correspondence between the accuracy reported in Figure 4 and the proportion of a given sample type illustrated in Figure 3; also see Table 2. It is for this reason that the models reported in the main text were trained on balanced (uniform) data sets (i.e., training sets that included an equal number of each sample type) in order to ensure that model accuracy was sample type independent. Note, however, that balanced training was never possible for T hor = 1dt, as there are never enough switching samples, even when T seq = 2 seconds. For comparative purposes, the accuracy of models trained using unbalanced (representative) and balanced (uniform) training sets for T hor = 16 or 640 ms and T hor = 32 or 1280 ms, when T seq = 1 second, are detailed in Table 2 and  Table 3, respectively. The accuracy values in Table 3 and Mutual Information in Table 4 correspond to those reported in the main text.

Kendall rank correlation coefficients
The ordinal association of SHAP value rankings (the first top 10 reported in Tables 6-6) was computed using the Kendall rank correlation coefficient (Kendall's τ ) for subgroups of N top ranked input features. Table 6 includes the Kendall coefficients, and associated p-values, as a function of expertise. Table 5 includes the Kendall coefficients as a function of T hor = 16 and T hor = 32 prediction horizon for each level expertise.